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Abstract. We present two new methods for estimating the order (memory 
depth) of a finite alphabet Markov chain from observation of a sample 
path. One method is based on entropy estimation via recurrence times 
of patterns, and the other relies on a comparison of empirical conditional 
probabilities. The key to both methods is a qualitative change that occurs 
when a parameter (a candidate for the order) passes the true order. We 
also present extensions to order estimation for Markov random fields. 
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1 Introduction 



Fix a finite set A and let x™ denote the sequence x m , x m+ i, . . . , x n , where Xi G A. 
A stationary, ergodic, A- valued process X = {X n } is Markov of order M = if it 
is i.i.d., and Markov of order M > if M is the least positive integer such that 
P(afc + i|aJ ; ) = P(afe + i|a^_ M+1 ), for all a^ +1 such that k > M. A consistent Markov 
order estimator is a sequence of functions M*: A n {0, 1, . . .}, n > 1, such that for 
any M and any Markov process X of order M, 

limM*(x?) = M, a.s. 

n 

(Here and throughout, "a.s." always refers to the distribution P = Px of X.) In this 
paper we introduce two new Markov order estimators. Both use test functions that 
depend on the sample size and a candidate k for the order. The key to our methods is 
that as k increases, our test functions exhibit a qualitative change of behavior when 
k reaches the true order. 

Our estimators use the empirical frequencies of overlapping blocks, 

N n {a\) = iVnKK) = \{i G [0,n- = oJ}|. (1) 

The corresponding empirical probabilities and conditional probabilities are 

P n (a1 +1 ) d = f ^-N n {a\ +1 ) and P n (a k+1 \a>Z) ^ JV n (a* +1 )/^i(o*) . 
n — k 

We also define the fc-step conditional empirical entropy, 

Mn) = ~ Pn{a\ +1 ) log/^K+iK) • 

Our first method, which we call the entropy estimator method, compares hk(n) 
with the entropy estimator log n, where £(n) denotes the length of the longest 

initial block in x™ that repeats in x\ (see [13] and Section 2 below). 

Theorem 1 M*(x?) = min{£;: h k (n) < [£{n)}~ 1 \ogn + 2(logn)~ 1 / 4 } is a consistent 
Markov order estimator. 



Our second method, which we call the maximal fluctuation method, is based on 
the test function 



dcf 



max max 

m<fc</(n) a \&A k 



k-l 



(2) 



where f(n) = log log n. Define M* (x™) == min{m < n — f(n): m (x") < n 3 / 4 }; if the 
set we are minimizing over is empty, then we take M*(x") = n. 
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Theorem 2 M* (x™) is a consistent Markov order estimator. 

A more general form of Theorem Q that allows any entropy estimator with a 
known rate of convergence is given in Section |21 An extension of Theorem El to 
Markov random fields is given in Section I3~T1 Connections to other model selection 
methods are given in Section 0] 

Careful proofs of Theorems Q and 121 are given in Sections |2] and El respectively. 
For the reader's convenience, we first present sketches of the proofs. 

Sketch of proof of Theorem^ The 2(logn) -1 / 4 term incorporates the rate of 
convergence of [£(n)]~ l logn and that of /im(^) to their common almost sure limit, 
the entropy H(X) of X. Thus /ijvr(n) < [£(n)] _1 logn + 2(logn)~ 1//4 , eventually a.s., 
whence M* < M eventually a.s. On the other hand, if k < M then hk(n) converges 
a.s. to the fc-step conditional theoretical entropy H k (X), which exceeds H(X), the 
almost sure limit of [^(n)]^ 1 logra + 2(log n)~ 1//4 . Therefore M* > M eventually a.s. 

Sketch of proof of Theorem H|- If m < M then there exists a^ +l such that 
P(aM+i|oi^) > P(flA/+i|o 1 " f ~ m+1 ), and hence (j) m (xi) grows a.s. like cn, for some 
c > 0. Thus Mf > M eventually a.s. On the other hand, classical large deviations 
theory shows that for any e > 0, we have (Pm{ x i) — o{n l l 2+t ), a.s., so M* < M 
eventually a.s. 



2 The entropy estimator method. 

We first review some elementary facts about entropy, see [H] or |T7] for details. 
The conditional entropy of the next symbol given k previous symbols is defined by 

H k = H(X k+1 \X?) d ^ - PK +1 ) logP(a fc+1 |a*). 

The sequence {Hk} is nonincreasing with limit equal to the entropy H = H(X) of 
the process. Furthermore, the process is Markov of order M if and only if 

k < M H k > H(X) and 
k > M H k = H(X) , 

that is, if and only if H k reaches its limit H exactly when k = M, see |17| Thm 1.6.11]. 

The conditional k-th order empirical entropy h k (n) is defined by replacing theoret- 
ical probabilities by the corresponding empirical probabilities. The ergodic theorem 
implies that for k fixed, P n {af[ +1 ) — > P(a^ +1 ) and P n (a k+ \\a\) P(a k+ i\a'l), each 
with probability 1, and hence that h k (n) — > H k , a.s. Furthermore, in the Markov case 
we have the following iterated logarithm result. 
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Lemma 1 If X is Markov of finite order M then for each k there is a constant Ck 
such that 

\n u i M ^ /log log n 

-nfc — iijfeW < c fc\/ , eventually a.s. as n —> oo . 

V n 



Remark. A slightly weaker inequality (which would suffice for our application here), 
with an extra factor of logn on the right-hand side can be obtained by applying [3, 
Theorem 16.3.2] instead of (jlj) below. 

Proof of Lemma[I\ Let *f?(x) = xlogx — x + 1, so that \l/(l) = ^'(1) = 0. For 
x > 1/2 we have ty"(x) = l/x < 2, whence \^(x)\ < (x - l) 2 for all x > 1/2. 

Consider two distributions P and Q on the same alphabet A, and suppose that 



7 = max 



Q(a) 



<l/2. (3) 



<7 2 - 



Then the divergence D(P\Q) = J2 a P( a ) ^°S^(S) sa ^ snes 

d(p\q) = E [«w* + *<•) - «•>] = E [««)* (§§ 

Moreover, 

£)| (P(a)-Q(a)) log Q(a) | < logQ(o)| = 7 #(Q) < 7log \A\ . 

a a 

Adding the last two inequalities (using positivity of the divergence) gives 

\H(Q)-H(P)\< 1 2 + 1 \og\A\. (4) 

under the assumption Q. 

By the law of the iterated logarithm for finite-order Markov chains, there is a 
constant c*. such that 



Pn{a\] 



, . log logn 

< c k\ j eventually a.s. 

n 



P(af) 

so an application of (@J) to P n and P proves the lemma. 

□ 

The Ornstein- Weiss recurrence theorem, [14], states that for any ergodic finite 
alphabet process X, the time until the opening n-block occurs again, 

R n { x ) = min{r > n : x r r +i = ^i}> 
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grows like e nH<yX \ that is, (1/n) \ogR n (x) — > H(X) a.s. (Earlier, Wyner and Ziv, [T9| . 
established convergence- in-probability for a related recurrence idea.) In our setting 
£(n) = maxjfc: Rk < n} and the Ornstein- Weiss recurrence theorem gives 

1 

lim —— \ogR t{n) (x) =H(X), a.s. 
n-»oo t{n) 

Let M. denote the set of ergodic, A-valued processes X that are finite-order Markov. 
To obtain a rate of convergence for leMwe use Kontoyiannis' second-order result, 
P21 Corollary 1], that for any (3 > and X e M 

log[i?„.(£)P(£i)] = o(n p ), a.s. (5) 

The statement and proof were for Wyner-Ziv recurrence but can easily be adapted 
to Ornstein- Weiss recurrence. We use it to prove 

Lemma 2 V/3 > 1/2 and X e M, logR n (x) = nH + o(ra /3 ), a.s. 

Proof. Suppose X has order M and /3 > 1/2. The Markov property and the law of 
the iterated logarithm yield 

logPK) = logP(af)+ X>(af +1 )logP(a M+1 |aj tf ) 

af +1 

= (n — M) ^2 P« +1 ) logP(a M+ i|af ) + o{n p ) 

af +1 

= —nH + o(n^), a.s. 
which, combined with (J3J), yields the lemma. □ 

Lemma 3 For all X G M, 

log iV) <^y log n^H(X), a.s. 

Proof. Since i?«„)(a;) < n < i2^ n ) +1 (ar), the lemma follows from 



1 ,._ D 1 i — '(») 
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and the fact that both the left-hand and right-hand terms go to H(X), a.s. □ 

We also need a lower bound on the growth of £(n). 

Lemma 4 For any IeM there is a constant C > suc/i that 
£(n) > Clogn, eventually a.s. 
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Proof. By the Ornstein- Weiss recurrence theorem, 

R k < e k{H+1) < e^+^H), eventually a.s. 
Thus we can take C = (1 + log □ 

The lemmas yield 
Proposition 1 For any X G M., 

——— logn > H(X) , eventually a.s. 

i[n) vlogn 

Proof. The following chain of inequalities holds eventually a.s. 

1 (a) 1 (b) 1 

logn > — logity n) > H ( x ) 



(n) to " £{n) to * w - v ' £(n) 3 / 8 

(c) 1 (d) i 



[CTogn] 3 / 8 - v > ^Kg^' 

inequality (a) by Lemma El inequality (b) by Lemma El for /3 = 5/8 and inequality 
(c) by Lemma 01 while inequality (d) is clear. □ 

We are now ready to prove Theorem Q which for ease of reference we restate here. 

Theorem 1 M* n {x\) = min{A;: h k (n) < ^(n)]" 1 logn + 2(logn)" 1 / 4 } is a consistent 
Markov order estimator. 

Proof. Suppose X e M. has order M and entropy H = H(X). We first show that 
underestimation does not occur, eventually a.s. For m < M, the simple facts 

(a) h m (n) ^ H m as n — > oo and H m > H, 

(b) \l(n)}- 1 logn ^ # and 2(logn)~ 1 / 4 -> 0, 

immediately imply that h m (n) > [£(n)] _1 logn + 2(logn) -1 / 4 , eventually a.s. 
The following chain of inequalities holds eventually a.s. 



~ W /loglogn 00 1 1 /loglogn 

h>M\ n ) < H + c\ < — - logn + + cW 

V n £(n) ylogn V n 

(«=) 1 , 2 
< — — log n + 



(n) ^log n ' 
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inequality (a) by Lemma [T] and the fact that H M = H, and inequality (b) by Propo- 
sition [H while inequality (c) is obvious. We conclude that M.ni x x) — eventually 
a.s. □ 



As the above proof suggests, the entropy estimator [£(n)] 1 \ogn can be replaced 
by any consistent entropy estimator H(x™) that has an o(l) underestimation bound, 
i.e., a function u(n) — > such that for all X G A4 

H{x1) > H(X) - u(n), eventually a.s., (6) 
provided we replace 2/ ^logn by \u(n) \ + (1/n) logn. 

Theorem 1 (General form) 

Let H(xi) be a consistent entropy estimator with o(l) underestimation bound u(n). 

def ^ ^ 

Then M*(xi) = min{fc: ht{n) < H(x™) + \u(n)\ + (l/n) logn} is a consistent Markov 
order estimator. 

We used the recurrence-based entropy estimator as it is one of the simplest to 
describe and compute, it easily updates as n increases, and its second order properties 
are easy to determine. Its underestimation bound 1 / ^log n goes to very slowly, 
however, which suggests that its associated order estimator M*(x") converges slowly 
to M. Furthermore, though the recurrence idea does generalize to higher dimensions, 
see [12], a useful rate theory for it has not been established. In Section E~T] we present 
another entropy estimator that has a more rapidly convergent underestimation bound 
and is extendable to higher dimensions. 



3 The maximal fluctuation method. 

We now prove the second theorem stated in the introduction, namely, 

Theorem 2 Mjf (x") = min{m < n — f(n): <f> m (x?) < n 3 / 4 } is a consistent Markov 
order estimator. (Recall that we defined M*(x") = n if this set is empty). 

Proof. Let 

5 m {a\\x n x ) =N n {a\) - N n ^{a^)P n {a k \a k k zl) 

and note that 

4>m( x i) — max max5 m (ai\xi) , (7) 

rn<k<f(n) a \ 

where f(n) = log logn. We first show that eventually a.s. underestimation does not 
occur. Suppose X E M has order M and m < M. Choose a^ +l such that 

P(a M +i\a^) > P(a M +i|aM_ m+ i)- 
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By the ergodic theorem there exists e > such that, eventually a.s., 

N n _ x {a^ ) > en and P n (a M+1 \a^) - P„(a AJ+ i|4^ m+1 ) > e. 

This implies that <f) m (xi) > e 2 n and hence that M*(x") > M, eventually a.s. 

It takes somewhat more effort to show that, eventually a.s., 0m(^i) < n 3 ^. We 
first note that for fixed k > M, 

Z k (n) = N n {a\K) ~ Wn-i^larr^Pfaftlafck), n > k, (8) 

is a martingale with bounded differences. Indeed, with x{B) denoting the indicator 
of B, we can write Z k {n) = Y^j=k ^k(j), where 

Ml) = x{X]_ k+l = a\) - X {X]Zl +1 = 4- 1 )P(a fe |at] v/ ), 

and direct calculation shows that E(/S. k (j)\X{~ 1 ) = and IIA^)!^ < 1 for j > k. 
From the Hoeffding-Azuma large deviations bound for martingales with bounded 
differences, [TT1 IT], the probability that \Z n \ > n 3//4 is at most 2 exp(— n 1 / 2 /2). 
A similar argument also shows that for 

Z* k (n) d = f NM-mK) -iVn-iKlMkr^PKktD^ « > fc > 

the probability that |Z^(n)| > n 3 ^ 4 is at most 2 exp(— n 1 ^ 2 /2). 
Next we note that 

Z k {n) - <WKK) = — — Z fc (n) , 

^n-i{a k - M ) 

which has absolute value at most Thus, the probability that 5m(cli\xi) > 

2n 3 / 4 is less than 4 exp(— n l l 2 /2). Since there are at most |y4|-^ ra ) +1 = possible 
sequences a\, it follows from (J7J) and an application of Borel-Cantelli that eventually 
a.s., (J)m(xi) < n 3 ! 4 . This completes the proof of Theorem El □ 

Remark 1 After one of us lectured on these results B. Weiss noted that in 

recent joint work he did with G. Morvai, they independently developped the estimator 
M* discussed in Theorem^ 

3.1 Markov Random Fields 

The method of maximum fluctuations extends in modified form to Markov random 
fields, where order is usually called range. We confine our discussion to the two 
dimensional (2-d) case; the extension to higher dimensions is straightforward. 

We use the following notation. 

1. S t = -t<i<t, -t<j<t} = the square of width 2t + 1, centered at 

the origin. (Note that S t+S \ S t is a square "annulus" of thickness s.) 



8 



2. St(u) = f the square of width 2t + 1 with center at w £ Z 2 . 

3. A n = f the square of width n with lower left corner at (1,1). 

4. A configuration a(A) is a function a: A h A; if no confusion results its restriction 
to A' C A will be denoted by a(A'). 

A random field is a collection X = {X{n): n E Z 2 } of random variables with 
values in A. Unless stated otherwise, random fields are assumed to be stationary and 
ergodic. We use the conditional probability notation 

p, rAWA m*f Prob(X(A) =q(A),X(AQ) = b(A')) 
P(G(A)|6(A)) = Prob(X(A') = 6(A0) ' 

A random field is said to be Markov with range R = if it is i.i.d, and Markov with 
range R > 1 if R is the least positive integer r such that for all £ > and t > 

P{a{S e )\b{Si+r+t \ S t )) = P(a(S e )\b(S e+r \ S e )), 

for all configurations a(Se) and b(Se +r+t \ 5^). That is, i? is the least r such that 
the random variables X(Se) on the inner square and X(Se +r+ t \ 5£ +r ) on the outer 
annulus are conditionally independent, given the values X(Sg +r \ Sg) on the inner 
annulus. The range of a finite-range random field X is denoted by R = R(X). 

Our 2-d maximum fluctuation method tests whether configurations on a square 
are conditionally independent of those outside a square that is expanded by r in each 
axis direction, given the configuration in the annulus between the two squares. Not 
only do we need to test over a (slowly) growing interval of possible orders r, but now 
we also need to examine a (slowly) growing interval of sizes £ for the inner square, 
as order can depend on square size, though it eventually becomes constant as square 
size increases. Counting overlapping blocks as in ((TJ) will not be used because the 
higher dimensional analogue of (jSJ) need not be a martingale. We focus instead on 
counting nonoverlapping blocks, to which classical large deviations is applicable, but 
now we must also consider translates. 

Given n > 8, let £, r, and t be integers in the closed interval [0, log log n] and put 



k = f £ + r + t, and T = f 



2k + 1 



1. 



We assume the integer n is large enough to guarantee that T > for all k < 3 log log n. 
Let 

n fc = {S/cCui), S k (u 2 ), • • • , S k {u T 2)} 

be the partition of the square A( 2 fc+i)T into squares of width 2fc+l. For each v G A 2 fc+i, 
let Hk(v) = {Sk(v + Uj),l < j < T 2 } be the translated partition of the square 
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V + A(2fc+1)T Q An- 




Given a configuration x(A n ) and a configuration a(A) on a centrally symmetric 
subset Ac5t, and given a vector e A 2 k+i, put 



A^(a(A)) = A^(a(A)|x(A„)) d = x(u + % + w) = a(w),Vw G A}, 

that is, the number of times the configuration a(A) appears in x(-), centered at a 
member of the translated partition Uk(v). Our 2-d test function is 



This is maximized over configurations a(«Sfc) and translates -u to produce 



5 e ^ t (x(A n )) = f max max 5 <ir)t)B (a(5 fc )|a;(A n )) . 



For £ = [log log nj define 



(f) r (x(A n )) 



dcf 



n max 5 v , t (x(A„)). 

0<t<log logn 



Our 2-d order estimator is 

def 



(9) 



R^(x(A n )) = min{r < n — 3 log logn: <p r < n 3 / 2 }, 

where, if there is no such r < n — 3 log logn, we set _R*(x(A„)) = n. 

Theorem 3 Let X be a stationary, ergodic, finite range random field on Z d . Then 
i?*(x(A n )) = R{X), eventually a.s. 
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Proof. If r < R = R(X), an argument similar to the 1-dimensional case shows 
that 4> r (x(A n )) > Cn 2 , eventually a.s., for a some C > 0. Thus, underestimation 
eventually a.s. does not occur. 

To complete the proof it is enough to show that <p R < n 3 ^ 2 , eventually a.s. Towards 
this end, we fix £ > and t > 0, put r = R and k = £ + R + t, fix a(S k ) and v G A 2 fe+i, 
and put N = Ny. Our 2-d test function (jHJ) can then be expressed as the sum 



N(a(S k ))-N(a(S k \S e )) 



N(a(S e+R )) 
N(a(S e+R \ S e )) 



Ai + A 2 , 



where 

Ax 

and 

A 2 



< 



N(a(S k )) - N(a(S k \ S e ))P(a(S e )\a(S e+R n 5,)), 
iVK^x^)) |"^ (o(5 \ 5,))P(a(5,)|a(5, +fl x S £ )) - N(a(S e+R )) 
N(a(S e+R \ S t ))P(a(S t )\a(S e+R x 5,)) - JV(a(^ +fl )) . (10) 



Denote p = P(a(Sg)\a(Se+ R \ «S^)) and Wj = Uj + v. Then we can write Ai = 
J2j=i Aij, where with x(-) denoting the indicator function, 

A iJ = f x(x(Sk(wj)) = a(S k )^j - x\X([Sk \ 5/](tDj)) = a(5 fc \ <S<?)Jp. 

Therefore, conditioned on the values a(«Sfe\<S^) in the square annulus, Ai is a sum 
of N(a(S k \<S^)) < T 2 binary i.i.d. mean random variables. The classical Hoeffding 
large deviations bound, [TT], implies that the probability that |Ai| > |n 3 / 2 is at 
most 2exp(— n/4). The inequality (jlLip implies that the same result holds for |A 2 |. 
Since there are only subexponentially many a(S k \ Se) and v G A 2 fc+i to consider, 
the Borel-Cantelli lemma implies that (fi R < n 3 ^ 2 , eventually a.s. This completes the 
proof of Theorem El □ 



Remark 2 To simplify the discussion we focused on squares rather than diamonds 
which are more natural in Ising models. Our concepts and results can easily be con- 
verted to the latter setting. 

Remark 3 Csiszdr and Talata, |?|/, have recently shown the existence of a consistent 
range estimator for a restricted class of Markov random fields, namely, those for 
which, conditioned on any boundary, probabilities in a square are positive, a condition 
that allows them to focus only on squares of size 1, rather than squares of growing 
size as we did. They assume no bound on the range and use a variant of the BIC in 
which maximum likelihood is replaced by maximum pseudolikelihood. 
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4 Extensions and related work. 



4.1 Other entropy estimators. 

There are many known consistent entropy estimators, for most of which o(l) 
underestimation bounds for the Markov case have not been established. In addition 
to the recurrence estimator such an underestimation bound can be shown to hold for 
hf( n ){n), where for example f(n) = 'gffijf - 

Proposition 2 There a positive constant C such that for any X e M, 

(a) h f{n) (n) -> H(X), a.s. 

(b) h f (n)(n) > H(X) - C 1 -^, eventually a.s. 



Proof. By the Ornstein- Weiss entropy estimation theorem, ^3], the per-symbol em- 
pirical block entropy j^if (P/( n )(-)) — > i/, a.s. as /(n) — >■ oo, provided only that 

/(Vi) < ]§S> f° r some e > 0. It is easy to see that this implies hf( n )(n) — > if, a.s., for 
the case /(n) = ^g^p. This proves (a). 

To establish part (b), suppose X G M. has order M. The BIC consistency theo- 
rem, see implies that 

\A\^ n \\A\-\)^ t . . \A\ M (\A\-\) y t , . 
log n + nhf( n ){n ) > logn + nh M {n), 



eventually a.s. Using the relation |y4|^( ra ^ = logn and the bound nhM^n) > nH — 
clog log n, which holds eventually a.s. by Lemma ^ we obtain 

— |A| M (|A| — 1) (\A\ — 1) 
nh f(n) (n) > nH — c log log n H log n log 2 n, 

from which (b) follows. □ 

Remark 4 The empirical entropy estimator hf( n )(ri) converges to entropy faster than 
the recurrence-based estimator, which is not surprising as the latter uses so little about 
the sample path. We suspect there may be a more direct proof of Proposition\^(b) than 
the one we gave. 

Remark 5 An important example for which an o(l) underestimation bound is not 
known is the Lempel-Ziv entropy estimator, An 0((l/n) logn) underestimation 
bound for the class A4o of i.i.d. processes has been established, see 0/, a result we 
suspect can be extended to the class M. . 
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4.2 The "flat spot" problem. 

For the Markov order estimation problem, it is tempting to take as order estimator 
the first k for which hk(n) — hk+i(n) < n^ 1 ^. This eventually a.s. gets stuck at the 
first k for which Hk = Hk+i- Such flat spots can occur for k < M — 1. This shows, 
incidentally, why we needed to take the maximum over a growing interval of possible 
orders in the definition (J2J) of our maximal fluctuation test function. 

Remark 6 The "no flat spot" case is "generic" for it is easy to see that in the usual 
parametrization of the set of X G M. of order M as a subset of \A\ M {\A\ — 1)- 
dimensional Euclidean space, the set of X of order M whose conditional entropy has 
flat spots before M has Lebesgue measure 0. This is a good example where genericity 
is not an interesting concept. 

4.3 The BIC, MDL, and related methods. 

Two important and related methods, the Bayesian Information Criterion (BIC) 
and the Minimum Description Length (MDL) Principle are the basis for many model 
selection methods, see [21 IH El for discussion and references to these and other meth- 
ods. Both the BIC and the MDL focus on selecting the correct class from a nested 
sequence of parametric model classes, Ai C M.\ C M.2 ■ ■ ■ , based on a sample path 
drawn from some P G UAik- 

The BIC, introduced by Schwarz [TE] . is based on Bayesian principles and leads 
to the model estimator 

^bicK) = argmm(-logP ML(fc) (x?) +^log«), 

where PML(fc)(^i) is the k-th order maximum likelihood, i.e., the largest probabil- 
ity given to x™ by distributions in A4k, and <fi{k) is the number of free parameters 
needed to describe members of M.k- For the Markov order estimation problem, M. k = 
{X G M: M(X) < k}, -logP M L(fe)K) = (n - k)h k (n), and <f>(k) = \A\ k (\A\ - 1). 
Schwarz ^Hl proved consistency if the model classes are i.i.d. exponential families and 
a bound on the number of models is assumed, a result later extended to the Markov 
case by Finesso [TOj- The first consistency proofs for the Markov case without an 
order bound assumption are given in The proofs are surprisingly complicated, 
though they have been simplified somewhat in 0], which focuses on MDL consistency. 

The MDL principle, introduced by Rissanen (see [2]), is based on universal coding 
ideas. For each k < n, the sequence a;" is encoded using a binary code that is 
"optimal" for the class M.k arid the model that has the shortest code length is chosen, 
that is, 

MmdlM = argminAK) (11) 

k 

where £fc(x™) is the length of the code word assigned to x\. Different concepts 
of "optimal" lead to different estimators. For a discussion of consistency for such 
estimators without a prior order bound, see [U] and jlj. 
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